Exposition and Analysis of a Suffix Sorting Algorithm
نویسنده
چکیده
This paper focuses on the suffix sorting algorithm of Maniscalco [10], which at the time of writing is available only as C++ source code on the Internet. We will refer to the program as MSufSort. MSufSort computes the Inverse Suffix Array (ISA) of an input string, which is equivalent to computing the Suffix Array (converting one to the other is discussed in section 8). Recall that for i ∈ [0..n − 1], ISA[i] gives the lexicographic rank of the suffix x[i..n − 1] amongst all the other suffixes of the string x[0..n − 1]. Experiments summarized in [10] suggest that MSufSort outperforms the fastest known suffix sorting programs, while using little extra space aside from the 4n bytes to hold the suffix array and the n bytes for the input string (in the terms of [11] it would be lightweight). It is also purported to perform well on periodic strings, which are known to be catastrophic worst cases for some algorithms. This paper addresses the need for a more formal examination of what appears to be a very robust suffix sorter. We examine and describe the inner workings of the algorithm, and try to explain why MSufSort performs well by analyzing its asymptotic behavior. As published in [10], the MSufSort source code crosses several classes and files and is not easy absorb in a single sitting. The code presented in this paper constitutes a complete rewrite of the original as just a few C functions, and is included not as an optimization, but rather to aid explanation of the approach. After introducing some notation in Section 2, the basic algorithm is described in Section 3 before two powerful heuristics are introduced in Sections 4 and 5. Sections 6 and 7 consider time and space usage. Section 8 discusses ISA to SA transformation. In Section 9 we extensively test MSufSort and compare its performance to that of other leading suffix sorters. Possible areas for future work are outlined in Sections 10 and 11, and brief conclusions are offered in Section 12. It is assumed the reader is familiar with the concept of suffix sorting and its applications, particularly the Burrows-Wheeler Transformation (BWT).
منابع مشابه
Linear-time Suffix Sorting - A New Approach for Suffix Array Construction
This thesis presents a new approach for linear-time suffix sorting. It introduces a new sorting principle that can be used to build the first non-recursive linear-time suffix array construction algorithm named GSACA. Although GSACA cannot hold up with the performance of state of the art suffix array construction algorithms, the algorithm introduces a lot of new ideas for suffix array constructi...
متن کاملAn Algorithm for Suffix Sorting and Its Applications∗
The suffix tree is a data structure that has found applications in various important problems, such as genetic sequencing, pattern matching and computational biology. Its derivative data structure, the suffix array, is another representation with the added advantage of a small memory footprint. We propose a simple O(n log n) time divideand-conquer sort-and-merge algorithm for solving the suffix...
متن کاملParallel Suffix Sorting
We present a parallel algorithm for lexicographically sorting the suffixes of a string. Suffix sorting has applications in string processing, data compression and computational biology. The ordered list of suffixes of a string stored in an array is known as Suffix Array, an important data structure in string processing and computational biology. Our focus is on deriving a practical implementati...
متن کاملNotes on Suffix Sorting
We study the problem of lexicographically sorting the suffixes of a string of symbols. In particular, we analyze the time complexity of Sadakane’s suffix sorting algorithm [8], showing that this is O(n log n) in the worst case. We also give a small improvement in the space requirements of this algorithm. We conclude that Sadakane’s algorithm, which has previously been shown to outperform the cl...
متن کاملImproving the Speed of LZ77 Compression by Hashing and Suffix Sorting
Two new algorithms for improving the speed of the LZ77 compression are proposed. One is based on a new hashing algorithm named two-level hashing that enables fast longest match searching from a sliding dictionary, and the other uses suffix sorting. The former is suitable for small dictionaries and it significantly improves the speed of gzip, which uses a naive hashing algorithm. The latter is s...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005